The regular grid is made by setting up a 180 \(\times\) 240 grid then constraint it by keeping those inside the city boundary, in which 17472 grids are included (spatial resolution: 637 ft \(\times\) 576 ft/pixel cell).

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/City_Boundary", layer: "City_Boundary"
## with 1 features
## It has 4 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Street_Center_Line", layer: "Transportation"
## with 55747 features
## It has 48 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Major_Streets", layer: "Major_Streets"
## with 16065 features
## It has 65 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/police_stations_poly", layer: "police_stations"
## with 26 features
## It has 6 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/School_Grounds", layer: "School_Grounds"
## with 1024 features
## It has 23 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Parks_Aug2012", layer: "Parks_Aug2012"
## with 583 features
## It has 76 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Hospitals", layer: "Hospitals"
## with 42 features
## It has 23 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Libraries", layer: "Libraries"
## with 76 features
## It has 8 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/CTA_BusStops", layer: "CTA_BusStops"
## with 11308 features
## It has 13 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/CTA_Routes", layer: "CTA_Routes"
## with 130 features
## It has 7 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/CTA_RailLines", layer: "CTA_RailLines"
## with 154 features
## It has 9 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/GISData/Buildings", layer: "Buildings"
## with 820154 features
## It has 42 fields

Compute time-invariant (Spatial) features

Visualize feature maps

The “near-repeat-effect” feature is calculated by doing a 3D convolutiong with the kernel specified below: \[ K(x,y,t) = \frac{1}{2\pi \sigma^2}\exp(-\frac{(x+y)^2}{2 \sigma^2}) \cdot \exp(-\lambda t)\] where \(\sigma=2\) (2 grid cells of which the size is 637 ft \(\times\) 576 ft), \(\lambda=0.1\) in these experiments.

The features are consisted of the following: 1. time-related, 2. weather, 3. space-related 4. kernel smoothed spatio-temporal neighboring points. The temporal and weather features are not spatially varying. So for each feature, we uniformly assign each pixel the same value throughout the entire city accordingly.

Since we rasterized the point data to a dense grid, we get one incident in a bin most of the times. However, occassionaly there are two or more incidents happending within a bin at our resolution. So we did two experiments: categorizing our target class into three classes where 0 indicates no incident, 1 indicates one incident, 2 indiccates two or more incidents; categorizing the target class into two classes where 0 indicates no incident, 1 indicates incident present.

The following two plots show an example of kernel-smoothed neighoring points (near-repeat effect). The left one is the near-repeat feature map for day t. On the right is the binned day t’s actual incident point map.

Using one year training data (data of year 2014 here), we do univariate aanalysis for each feature to see its separability for the classification problem. 1. The temporal features are not location-depended, so we use box plots to see the city wide incident counts varying by time. 2. The spatial and near-repeat effect features are presented as histograms for each class. Also their corresponding smoothed density curves are shown below the histograms.

The original training data set is of all pixels of feature maps of year 2013 which means we have 17472 \(\times\) 365 = 6377280 instance in total. Among them, only 17370 (0.2%) have non-zero class labels. So we have to down-sample the zero class instances to deal with the imbalanced data problem. Here we simple apply the random down sampling to let the size of the zero class to be roughly equal to that of the nonzero ones.
The test data is all pixel maps of year 2014.

Three-category case LDA

## [1] "Confusion Matrix (LDA: Training)"
##             Predicted Label
## Actual Label     0     1     2
##            0 10873  6497     0
##            1  4901 12084    19
##            2    94   270     2
## [1] "Confusion Matrix (LDA: Testing)"
##             Predicted Label
## Actual Label       0       1       2
##            0 3983954 2379220     113
##            1    4215    9521       2
##            2      67     188       0

multinomial logistic regession

## [1] "Confusion Matrix (Multinomial Logistic Regression: Training)"
##             Predicted Label
## Actual Label     0     1     2
##            0 10860  6510     0
##            1  4627 12377     0
##            2    76   290     0
## [1] "Confusion Matrix (Multinomial Logistic Regression: Testing)"
##             Predicted Label
## Actual Label       0       1       2
##            0 3950157 2413130       0
##            1    4056    9682       0
##            2      58     197       0

Two-category case

The continuous features are scaled to [0,1] as feature set is mixed with categorical and continuous variables. The test set is scaled according to the scaling parameters of the training set. Because the features are on the same scaling and the models are linear, we can compare the coefficients to see how important these features are.

LDA

## [1] "Confusion Matrix (LDA: Training)"
##             Predicted Label
## Actual Label     0     1
##            0 10761  6609
##            1  4744 12626
## [1] "Confusion Matrix (LDA: Testing)"
##             Predicted Label
## Actual Label       0       1
##            0 3835298 2527989
##            1    3892   10101
##                         LD1
## MONTH2          -0.20490622
## MONTH3          -0.19694654
## MONTH4          -0.13778232
## MONTH5           0.06936944
## MONTH6           0.10379080
## MONTH7           0.08702186
## MONTH8           0.22104552
## MONTH9           0.07272137
## MONTH10          0.01947446
## MONTH11         -0.13873167
## MONTH12         -0.16747763
## DOWMon           0.36047119
## DOWTue           0.34205564
## DOWWed           0.39729809
## DOWThu           0.34533166
## DOWFri           0.43429798
## DOWSat           0.11283289
## HOLIDAY1        -0.41200865
## HOLIDAY2         0.22972375
## HOLIDAY3         0.01949629
## HOLIDAY4        -0.46687416
## HOLIDAY5        -0.26744721
## HOLIDAY6        -0.59129427
## HOLIDAY7        -0.09322696
## HOLIDAY8        -0.27063640
## HOLIDAY9         0.24745510
## HOLIDAY10        0.24373336
## HOLIDAY11       -1.17785090
## Tsfc_F_avg      -4.16203520
## Tdew_F_avg      -0.72960953
## Rh_PCT_avg       0.65701066
## Psfc_MB_avg     -0.45419082
## CldCov_PCT_avg   0.11875899
## Tapp_F_avg       4.74856379
## Spd_MPH_avg     -0.28284864
## Tsfc_F_max       1.74912078
## Tdew_F_max      -0.12963190
## Rh_PCT_max      -0.14494710
## Psfc_MB_max      0.62148081
## CldCov_PCT_max   0.06568887
## Tapp_F_max      -1.46097476
## Spd_MPH_max      0.14228001
## Tsfc_F_min       0.45956227
## Tdew_F_min      -0.02922019
## Rh_PCT_min      -0.23133573
## Psfc_MB_min     -0.29939221
## CldCov_PCT_min  -0.03123982
## Tapp_F_min      -0.45722650
## Spd_MPH_min      0.18428040
## PcpPrevHr_IN    -0.41888025
## StrDen           3.17084918
## Dist2Street     -0.37038070
## Dist2CPDstation -0.43641758
## Dist2School     -2.78269611
## Dist2Park        2.79578979
## Dist2Hospital   -2.13132746
## Dist2Library    -0.77706652
## Dist2BusStop    -7.08997920
## Dist2CTAroute    8.03536679
## Dist2CTArail    -0.24230627
## BldgDen          2.44078733
## NearbyEffect     9.24851174

Logistic Regression

## [1] "Confusion Matrix (Binomial Logistic Regression: Training)"
##             Predicted Label
## Actual Label     0     1
##            0 10692  6678
##            1  4413 12957
## [1] "Confusion Matrix (Binomial Logistic Regression: Testing)"
##             Predicted Label
## Actual Label       0       1
##            0 3790347 2572940
##            1    3722   10271
##                     Estimate Std. Error      z value      Pr(>|z|)
## (Intercept)      -0.45031096 0.22005315  -2.04637365  4.071962e-02
## MONTH2           -0.19078148 0.06554473  -2.91070689  3.606121e-03
## MONTH3           -0.17579395 0.06391923  -2.75025149  5.954954e-03
## MONTH4           -0.12193184 0.06892248  -1.76911577  7.687456e-02
## MONTH5            0.07704140 0.08246815   0.93419591  3.502028e-01
## MONTH6            0.09806720 0.08930624   1.09810018  2.721607e-01
## MONTH7            0.09104203 0.10033568   0.90737447  3.642088e-01
## MONTH8            0.22119612 0.09938362   2.22567985  2.603563e-02
## MONTH9            0.08145580 0.09077018   0.89738503  3.695135e-01
## MONTH10           0.02946502 0.07534631   0.39106116  6.957520e-01
## MONTH11          -0.13574785 0.09064655  -1.49755125  1.342499e-01
## MONTH12          -0.17566166 0.07801911  -2.25152090  2.435256e-02
## DOWMon            0.34019567 0.04880496   6.97051412  3.157848e-12
## DOWTue            0.32365370 0.04812280   6.72557952  1.748950e-11
## DOWWed            0.38408401 0.04820437   7.96782513  1.614912e-15
## DOWThu            0.33230721 0.04796125   6.92866026  4.248447e-12
## DOWFri            0.42009143 0.04749848   8.84431373  9.209253e-19
## DOWSat            0.10884121 0.04915330   2.21432164  2.680666e-02
## HOLIDAY1         -0.42048110 0.25507382  -1.64846824  9.925662e-02
## HOLIDAY2          0.25631783 0.25257230   1.01482954  3.101871e-01
## HOLIDAY3          0.02833811 0.26380813   0.10741941  9.144563e-01
## HOLIDAY4         -0.40895508 0.23850227  -1.71468005  8.640390e-02
## HOLIDAY5         -0.24770382 0.23591099  -1.04998845  2.937234e-01
## HOLIDAY6         -0.47630660 0.24383430  -1.95340281  5.077188e-02
## HOLIDAY7         -0.05359161 0.22955564  -0.23345806  8.154057e-01
## HOLIDAY8         -0.21252636 0.24862250  -0.85481546  3.926533e-01
## HOLIDAY9          0.22185598 0.25345211   0.87533690  3.813906e-01
## HOLIDAY10         0.23366364 0.26915605   0.86813448  3.853207e-01
## HOLIDAY11        -1.13589685 0.29164715  -3.89476409  9.829435e-05
## Tsfc_F_avg       -4.68005904 2.44937141  -1.91071841  5.604078e-02
## Tdew_F_avg       -0.52842403 1.38887674  -0.38046863  7.035976e-01
## Rh_PCT_avg        0.60512832 0.46163743   1.31083028  1.899151e-01
## Psfc_MB_avg      -0.55459548 0.39769820  -1.39451341  1.631626e-01
## CldCov_PCT_avg    0.12232861 0.11888517   1.02896438  3.034964e-01
## Tapp_F_avg        5.00032567 2.23302205   2.23926390  2.513875e-02
## Spd_MPH_avg      -0.22325369 0.24486304  -0.91174926  3.619007e-01
## Tsfc_F_max        1.91783674 1.01018919   1.89849263  5.763122e-02
## Tdew_F_max       -0.09195437 0.41049234  -0.22400996  8.227496e-01
## Rh_PCT_max       -0.15309337 0.14106577  -1.08526239  2.778055e-01
## Psfc_MB_max       0.65766167 0.26316674   2.49903036  1.245336e-02
## CldCov_PCT_max    0.06627429 0.09046835   0.73256881  4.638215e-01
## Tapp_F_max       -1.57732351 0.97288372  -1.62128678  1.049561e-01
## Spd_MPH_max       0.11525909 0.15387850   0.74902661  4.538412e-01
## Tsfc_F_min        0.40117420 0.95042110   0.42210153  6.729509e-01
## Tdew_F_min        0.02823478 0.42040401   0.06716107  9.464535e-01
## Rh_PCT_min       -0.24780658 0.22546999  -1.09906678  2.717389e-01
## Psfc_MB_min      -0.21526107 0.27190971  -0.79166379  4.285567e-01
## CldCov_PCT_min   -0.02344127 0.07813676  -0.30000308  7.641748e-01
## Tapp_F_min       -0.47296298 0.98278712  -0.48124662  6.303412e-01
## Spd_MPH_min       0.16601085 0.13199221   1.25773214  2.084886e-01
## PcpPrevHr_IN     -0.40487006 0.19142077  -2.11507910  3.442321e-02
## StrDen            2.22059989 0.23506472   9.44675951  3.494826e-21
## Dist2Street      -1.22373939 0.46438685  -2.63517233  8.409458e-03
## Dist2CPDstation  -0.39224389 0.17961458  -2.18380869  2.897631e-02
## Dist2School      -7.76357784 0.56065201 -13.84740935  1.318834e-43
## Dist2Park        -1.35826314 0.43526024  -3.12057712  1.804970e-03
## Dist2Hospital    -1.74922857 0.10219303 -17.11690681  1.110199e-65
## Dist2Library     -0.75412651 0.25117479  -3.00239736  2.678623e-03
## Dist2BusStop    -28.90400310 3.06186933  -9.43998584  3.728298e-21
## Dist2CTAroute    18.19773110 2.94798947   6.17292949  6.703611e-10
## Dist2CTArail      0.10858263 0.08905571   1.21926640  2.227431e-01
## BldgDen           2.17505657 0.05724182  37.99768328  0.000000e+00
## NearbyEffect      9.95994900 0.33040582  30.14459337 1.262740e-199

We compare our models with the hot-spot methods which are consisted of a long-term (preceding 365 days) density and two short-term (preceding 7 days and 14 days) density models. The incident points were first rasterized as pixel maps with the same resolution of the classification feature maps. Then the spatial Guassian kernel of form \(K(x,y) = \frac{1}{2\pi \sigma^2}\exp(-\frac{(x+y)^2}{2 \sigma^2})\) was applied with \(\sigma=2\) consistent with the one used in creating the near-repeat effect feature. The pixels were further scaled to [0,1] so that they are assumed to be the probability predictions therefore ROC curves can be generated.

The training ROC curve

The testing ROC curve

AUC
Model Training Testing
LDA 0.74 0.72
Logistic Regression 0.75 0.73
Long-term Density 0.77 0.75
Short-term Density 1 0.65 0.61
Short-term Density 2 0.69 0.65

Here we show an example of the prediction results (the last evalution example:2014-12-31) using different methods.